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Abstract 

We give polynomial-time algorithms for the exact computation of lowest-energy (ground) 
states, worst margin violators, log partition functions, and marginal edge probabilities in 
certain binary undirected graphical models. Our approach provides an interesting alterna- 
tive to the well-known graph cut paradigm in that it does not impose any submodularity 
constraints; instead we require planarity to establish a correspondence with perfect match- 
ings (dimer coverings) in an expanded dual graph. We implement a unified framework 
while delegating complex but well-understood subproblems (planar embedding, maximum- 
weight perfect matching) to established algorithms for which efficient implementations are 
freely available. Unlike graph cut methods, we can perform penalized maximum-likelihood 
as well as maximum- margin parameter estimation in the associated conditional random 
fields (CRFs), and employ marginal posterior probabilities as well as maximum a posteri- 
ori (MAP) states for prediction. Maximum-margin CRF parameter estimation on image 
denoising and segmentation problems shows our approach to be efficient and effective. A 
C++ implementation is available from http://nic.schraudolph.org/isinf/. 

Keywords: Markov random fields, spin glasses, plane embedding, blossom shrinking, 
marginal posterior mode 

1. Introduction 

Undirected graphical models are a popular tool in machine learning; they represent real- 
valued energy functions of the form 

E'{y) := + E (!) 

where the terms in the first sum range over the nodes V = {1, 2, . . . n}, and those in the 
second sum over the edges £ C V x V of an undirected graph G(V, £). 

The junction tree decomposition provides an efficient framework for exact statistical 
inference in graphs that are (or can be turned into) trees of small cliques. The resulting 
algorithms, however, require time exponential in the clique size, i.e., the treewidth of the 
original graph. The treewidth of many graphs of practical interest is prohibitively large — 
for instance, it grows as 0(n) for an n x n square lattice. A large number of approximate 
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inference techniques have been developed so as to deal with such graphs, such as pseudo- 
likelihood (Besag, 1986), mean field approximation, loopy belief propagation (Weiss, 2001; 
Yedidia et al., 2003), tree reweighting (Wainwright et al., 2003, 2005), and tree sampling 
(Hamze and de Freitas, 2004). 

1.1 The Ising Model 

Efficient exact inference is possible in certain graphical models with binary node labels. 
Here we focus on Ising models, whose energy functions have the form E : {0, l} n — ► R with 

E(y) := IVi^yAEij, (2) 

where [•] denotes the indicator function, i.e., the cost Eij is incurred only in those states y 
where yi and yj disagree. Compared to the general model (1) for binary nodes, (2) imposes 
two additional restrictions: zero node energies, and edge energies in the form of disagreement 
costs. At first glance these constraints look severe; for instance, such systems must obey 
the symmetry E(y) = E(^y), where -< denotes Boolean negation (ones' complement). 
However, it is well known (e.g., Globerson and Jaakkola, 2007) that adding a single node 
makes the Ising model (2) as expressive as the general model (1) for binary variables: 

Theorem 1 Every energy function of the form (1) over n binary variables is equivalent to 
an Ising energy function of the form (2) over n + 1 variables, with the additional variable 
held constant. 

Proof by construction: Two energy functions are equivalent if they differ only by a con- 
stant. Without loss of generality, denote the additional variable yo and hold it constant at 
yo ; — 0. Given an energy function of the form (1), construct an Ising model with disagree- 
ment costs as follows: 

1. For each node energy function E[(yi), add a disagreement cost of E^ := E[(l) — £^(0), 
as shown in Figure la. Note that in both states of y^ the energy of the resulting Ising 
model is shifted relative to E[(yi) by the same constant amount, namely E[(Q)\ 
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2. For each edge energy function E[-(yi,yj), add the three disagreement cost terms 

En := lKE'^0, 1) + E'^l, 0)) - (£^(0, 0) + E'^l, 1))], 

E 0i := £^.(1, 0) - EijiO, 0) - E ih and (3) 

E 0J := E'^0,1) -EijM-Eij, 

as shown in Figure lb. Note that for all states of yi and yj, the energy of the resulting 
Ising model is shifted relative to E[(yi) by the same constant amount, namely £L(0, 0): 
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(a) (b) (c) (d) 

Figure 1: Equivalent Ising model (with disagreement costs) for a given (a) node energy E[, 
(b) edge energy E[- in a binary graphical model; (c) equivalent submodular model 
if Eij > and > but Eqj < 0; (d) equivalent directed model of Kolmogorov 
and Zabih (2004, Fig. 2d). 
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Summing the above terms, the total bias of node i (i.e., its disagreement cost with the bias 
node) is 

E 0i = e[(\) - Em + i E ij^ °) - E ij^ °) - E *s\ - ( 4 ) 

This construction defines an Ising model whose energy in every configuration y is shifted, 
relative to that of the general model we started with, by the same constant amount, namely 

E'(oy. 

= E'(y)-J2E'M -J2 ^;(°>°) 

= E\y) - E\0). (5) 
The two models' energy functions are therefore equivalent. ■ 

Note how in the above construction the label symmetry E(y) — E(-^y) of the plain Ising 
model (2) is conveniently broken by the introduction of a bias node, through the convention 
that yo := 0. 

1.2 Energy Minimization via Graph Cuts 

Definition 2 The cut C of a binary graphical model G(V,£) induced by state y E {0, l} n 
is the set C(y) := E £ : y% ^ yj}; its weight \C(y)\ is the sum of the weights of its 

edges. 



Vye{0,l} n : E 
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Any given state y partitions the nodes of a binary graphical model into two sets: those 
labeled '0', and those labeled '1'. The corresponding graph cut is the set of edges crossing 
the partition; since only they contribute disagreement costs to the Ising model (2), we 
have Vy : \C(y)\ = E(y). The lowest-energy state of an Ising model therefore induces its 
minimum- weight cut. Conversely, the ground state of an Ising model (in absence of a bias 
node: up to label symmetry) can be determined from its minimum-weight cut via a simple 
(e.g., depth- first) graph traversal (see Algorithm 1). 

Minimum- weight cuts can be computed in polynomial time in graphs whose edge weights 
are all non- negative. Introducing one more node, with the constraint y n +i '•— 1, allows us 
to construct an equivalent energy function by replacing each negatively weighted bias edge 
Eoi < by an edge to the new node n + 1 with the positive weight := — Eoi > 

(Figure lc). This still leaves us with the requirement that all non-bias edges be non- 
negative. This submodularity constraint implies that agreement between nodes must be 
locally preferable to disagreement — a severe limitation. 

The now widespread use of graph cuts in machine learning to find lowest-energy con- 
figurations, in particular in image processing, was pioneered by Greig et al. (1989). Our 
construction (Figure lc) differs from that of Kolmogorov and Zabih (2004) (Figure Id) in 
that we do not employ the notion of directed edges. (In directed graphs, the weight of a 
cut is the sum of the weights of only those edges crossing the cut in a given direction.) 
We note that a more elaborate construction can give partial answers in graphs with some 
negative edge weights (Kolmogorov and Rother, 2007; Rother et al., 2007b), and that a 
sequence of expansion moves (energy minimizations in binary graphs) can efficiently yield 
an approximate answer for graphs with discrete but non-binary node labels (Boykov et al., 
2001). 

The remainder of this paper is organized as follows: Section 2 describes the planarity 
and connectivity conditions that our approach imposes upon the graphs, and how we handle 
them. Section 3 describes the construction of an expanded dual graph, and the algorithm 
for computing optimal (ground) states of the Ising model from it. The calculation of the 
partition function and marginal probabilities is dealt with in Section 4. These algorithms are 
then used in Section 5 to implement maximum-likelihood and maximum-margin parameter 
estimation in conditional random fields (CRFs). Section 6 describes our experiments using 
grid CRFs for image denoising and boundary detection, and Section 7 concludes with a 
discussion and outlook. We are making open source C++ code implementing our algorithms 
freely available for download from http://nic.schraudolph.org/isinf/. 

2. Planarity and Connectivity 

Unlike graph cut methods, the inference algorithms we will describe do not depend on 
submodularity; instead they require that the model graph be planar, and that a planar 
embedding be provided. They may also work only for connected graphs, and be the most 
efficient only for biconnected graphs. In this section we review these concepts, discuss their 
implications for our approach, and describe how to best handle them. 
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(a) (b) (c) (d) 

Figure 2: (a) a non-plane drawing of a planar graph; (b) a plane drawing of the same graph; 

(c) a different plane drawing of same graph, with the same planar embedding as 
(b); (d) a plane drawing of the same graph with a different planar embedding. 



2.1 Embedding Planar Graphs 

Definition 3 A graph is planar if it can be drawn in the plane R 2 without edge inter- 
sections. The regions into which such a plane drawing partitions M 2 are the faces of the 
drawing; the unbounded region is the external face. 

The operational nature of this definition would suggest that our algorithms must produce 
(or have access to) a plane drawing of the model graph. This is unsatisfactory in that 
such a drawing contains much information (such as the precise location of the vertices, and 
the exact shape of the edges) that we will not need. All we care about is the cyclic (say, 
clockwise) ordering of the edges incident upon each vertex. In topological graph theory, 
this is formalized in the notion of a rotation system (White and Beineke, 1978, p. 21f): 

Definition 4 Let G(V, £) be an undirected, connected graph. For each vertex i E V, let 
Si denote the set of edges in £ incident upon i, considered as being oriented away from i, 
and let i\{ be a cyclic permutation of £{. A rotation system for G is a set of permutations 

n = {m : i e V}. 

To define the sets £{ of oriented edges more formally, construct the directed graph G(V, £'), 
where £' contains a pair of directed edges (known as edgelets) for each undirected edge in 
£, that is, (ij) e £' « [(ij) e £ v Cm) e £}. Then Si = {(j,k) e £' : i = j}. 

Rotation systems directly correspond to topological graph embeddings in orientable 
surfaces: 

Theorem 5 Each rotation system determines an embedding ofG in some orientable surface 
S such that\/i E V, any edge (z, j) E £% is followed byni(i,j) in (say) clockwise orientation, 
and such that the faces T of the embedding, given by the orbits of the mapping (i,j) — > 
TTj(j,i), are 2- cells (topological disks). 

Proof see White and Beineke (1978, p. 22f). ■ 
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Note that while in graph visualisation "embedding" is often used as a synonym for 
"drawing" , in modern topological graph theory it stands for "rotation system" . We adopt 
the latter usage, which views embeddings as equivalence classes of graph drawings charac- 
terized by identical cyclic ordering of the edges incident upon each vertex. For instance, 
7T4(4, 5) = (4,3) in Figures 2b and 2c (same embedding) but 7^(4,5) = (4, 1) in Figure 2d 
(different embedding). A sample face in Figures 2b-2d is given by the orbit 

(4,1) - 7ri(l,4) = (1,2) - tt 2 (2,1) = (2,4) - 7r 4 (4,2) = (4,l). 

The genus g of the embedding surface S can be determined from the Euler characteristic 

|V|-|f| + |^| =2-2(7, (6) 

where \T\ is found by counting the orbits of the rotation system, as described in Theorem 5. 
Since planar graphs are exactly those that can be embedded on a surface of genus g — (a 
topological sphere), we arrive at a purely combinatorial definition of planarity: 

Definition 6 A graph G(V, £) is planar iff it has a rotation system II producing exactly 
2 + \£\ — \V\ orbits. Such a system is called a planar embedding of G, and G(V,£,H) is 
called a plane graph. 

Our inference algorithms require a plane graph as input. In certain domains (e.g., when 
working with geographic information) a plane drawing of the graph (from which the corre- 
sponding embedding is readily determined) may be available. Where it is not, we employ 
the algorithm of Boyer and Myrvold (2004) which, given any connected graph G as input, 
produces in linear time either a planar embedding for G or a proof that G is non-planar. 
Source code for this step is freely available (Boyer and Myrvold, 2004; Windsor, 2007). 

2.2 The Planarity Constraint 

In Section 1.1 we have mapped a general binary graphical model to an Ising model with 
an additional bias node; now we require that that Ising model be planar. What does that 
imply for the original, general model? If all nodes of the graph are to be connected to the 
bias node without violating planarity, the graph has to be outerplanar, i.e., have a planar 
embedding in which all its nodes lie on the external face — a very severe restriction. 

The situation improves, however, if we do not insist that all nodes be connected to the 
bias: If only a subset B C V of nodes have non-zero bias (4), then the graph only needs 
to be £>-outerplanar, i.e., have a planar embedding in which all nodes in B lie on the same 
face. Model selection may thus entail the step of picking the face of a suitably embedded 
planar Ising model whose nodes will be connected to the bias node. In image processing, 
for instance, where it is common to operate on a square grid of pixels, we can permit bias 
for all nodes on the perimeter of the grid, which borders the external face. 

In general, a planar embedding which maximizes a weighted sum over the nodes bor- 
dering a given face can be found in linear time (Gutwenger and Mutzel, 2004); by setting 
node weights to some measure of their bias, such as the magnitude or square of Ecu (4), 
we can thus efficiently obtain the planar Ising model closest (in that measure) to any given 
planar binary graphical model. 
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In contrast to submodularity, £>-outerplanarity is a structural constraint. This has the 
advantage that once a model obeying the constraint is selected, inference (e.g., parameter 
estimation) can proceed via unconstrained methods (e.g., optimization). 

Finally, we note that all our algorithms can be extended to work for non-planar graphs as 
well. They then take time exponential in the genus of the embedding though still polynomial 
in the size of the graph; for graphs of low genus this may well be preferable to current 
approximative methods. 

2.3 Connectivity 

All algorithms in this paper assume that the graph G(V, £) is connected, i.e., that £ contains 
at least one path between any two nodes of V. Where this is not the case, one can simply 
determine the connected components 1 of G in linear time (Hopcroft and Tarjan, 1973), 
then invoke the algorithm in question separately on each of them. Since each component is 
unconditionally independent of all others (as they have no edges between them), the results 
can be trivially combined. Specifically, 

• G is planar iff all of its connected components are planar; any concatenation of a 
planar embedding for each connected component is a planar embedding of G. 

• Any concatenation of a ground state for each connected component of G is a ground 
state of G. 

• The log partition function of G is the sum of the log partition functions of its connected 
components. 

• The edge marginal probabilities of G are the concatenation of edge marginal proba- 
bilities of its connected components. 

2.4 Biconnectivity 

Definition 7 A graph is biconnected iff it is connected and does not have any articulation 
vertex. An articulation vertex is a vertex whose removal (along with any incident edges) 
disconnects the graph. 

Although the algorithms in this paper do not require G(V, £) to be biconnected, simpler 
and more efficient alternatives are applicable when this is not the case. As Figure 3 illus- 
trates, any graph G can be decomposed into a tree of edge-disjoint biconnected components 1 
which overlap in the articulation vertices (of G) they share; this decomposition can be per- 
formed in linear time (Hopcroft and Tarjan, 1973). We make the following key observation: 

Theorem 8 Let £\, £2, . . . ,£ n be the edge sets comprising the biconnected components of a 
graph G(V, £). The probability of cuts induced by the states of a Markov random field (16) 

1. A component of a graph with respect to a given property is a maximal subgraph that has the property. 
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Figure 3: Skeletal chemical structures (images courtesy of Wikipedia) of phosphorus trioxide 
(bottom left), nitroglycerine (left of center), and quinine (right), with decomposi- 
tion into a tree of biconnected components indicated (shaded ovals). Phosphorous 
trioxide is biconnected (i.e., all one component); nitroglycerine is a tree (i.e., has 
only trivial biconnected components). 

over an Ising model 2 (2) on G factors into that of its biconnected components: 

n 

nc{y)] = \{nc{y)^£k]. (7) 

k=l 



Proof If G(V, £) is biconnected, n — 1 and £\ — £, making Theorem 8 trivially true. Oth- 
erwise split G into a biconnected component Gi(Vi, £\) which is a leaf of the decomposition 
tree, and the remainder G f (V f ,£ f ). It is always possible to find such a split. Because it is 
a leaf, G\ connects to G' through a single articulation vertex of G, which we call v\. To 
summarize: 

ViUV' = V, ViHV^^i}, 

Si U £' = S, £1 H £' = 0, Gi biconnected. (8) 

Let yi and y' be the state y of the model restricted to vertices in V\ and V', respectively. 
By definition, y\ is independent of y' when both are conditioned on the state y Vl of the 
articulation vertex connecting them. We now make use of the label symmetry of Ising 
models, resp. the fact that it is broken by conditioning on a single node: P[C(j/)] = F(y\yi) 
for any choice of i. We therefore have 

F[C(y)] = F(y\y vi ) = ¥( yi ,y'\y vi ) 

= F( yi \y Vl )F(y'\y Vl ) = F[C(y) n £1] P[C(y) n £']. (9) 

2. Recall that we consider an Ising model to be an undirected graphical model with binary node states, no 
node potentials, edge potentials in the form of disagreement costs, and an optional bias node. 
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Recursively applying this argument to G' yields Theorem 8. ■ 

The independence of biconnected components with respect to cuts C(y) stated by The- 
orem 8 may come as a surprise, given that the corresponding node states y do correlate 
through the articulation vertices. What happens is that label symmetry — specifically, the 
fact that C(y) = C(-*y) — folds the state space so as to exactly cancel any moments between 
biconnected components. By decoupling their biconnected components, this facilitates effi- 
cient inference in graphs that are not biconnected themselves. 

2.4.1 Ising Trees 

Note in Figure 3 how edges which are not part of any cycle of G form trivial biconnected 
components comprising only themselves and the two articulation vertices they connect. At 
the most extreme, a tree T does not contain any cycles, hence consists entirely of trivial 
biconnected components (Figure 3, left of center). Theorem 8 then implies that each edge 
can be considered individually, making inference in an Ising tree T(V, £) very simple: 

• T is planar; any embedding of T is a planar embedding. 

• The minimum-weight cut of T is the set £~ := E £ : Eij < 0}. 
(Use Algorithm 1 to obtain the corresponding ground state.) 

• The log partition function of T is 

InZ = In JJ (e° + e-^) = ^ ln^ + e"^)- ( 10 ) 
(ij)es (i,j)es 

• The marginal probability of any edge of T is 

P[(i,i)eC] = e ~ E l E .. = — (11) 

2.4.2 The General Case 

The most efficient way to employ the inference algorithms in this paper on graphs G that 
are neither biconnected nor trees (e.g., Figure 3, right) is to apply them to each nontrivial 
biconnected component of G in turn, then use Theorem 8 to combine the results, along with 
the simple solutions given in Section 2.4.1 above for trivial biconnected components, into 
a result for the full graph. Letting £ T C £ denote the set of edges that belong to trivial 
biconnected components of G, we have: 

• A planar embedding of G is obtained by arbitrarily combining a planar embedding 
for each of its nontrivial biconnected components and the edges in £ T . 

• A minimum- weight cut of G is the union between the edges in £~D £ T and a minimum- 
weight cut for each of G's nontrivial biconnected components; Algorithm 1 can be used 
to obtain the corresponding ground state. 
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• The log partition function of G is the sum of the log partition functions of its nontrivial 
biconnected components, plus J2(ij)eS r m (l + e~ Eij )- 

• The marginal probability of an edge (z, j) G £ is (1 + e Ei ^)~ l if (z, j) E £ T , or com- 
puted (by the method of Section 4) from the biconnected component it belongs to. 

This concludes our discussion of conditions on the Ising model graph G. In what follows, 
we assume that G is connected and planar, and that a plane embedding is provided. We 
do not require that G is biconnected, though where this is not the case, it is generally more 
efficient to decompose G into biconnected components as discussed above. 

3. Computing Ground States via Maximum- Weight Perfect Matching 

Definition 9 A frustrated cycle O C £ of a graph G(V, £) with non-zero edge weights E 
is a simple cycle whose product of edge weights is negative, i.e., Yl(i j)eO ^ij < 0. (A simple 
cycle is a closed path with no repeated edges or vertices.) 

A weighted graph is said to be frustrated if it contains any frustrated cycles. Note that 
trees can never be frustrated because they do not contain any cycles to begin with. 

The lowest-energy (ground) state := argmin y £?(?/) of an unfrustrated Ising model is 
easily found via essentially the same method as in a tree (Section 2.4.1): paint nodes as you 
traverse the graph, flipping the binary color of your paintbrush whenever you traverse an 
edge with negative disagreement cost (as done by Algorithm 1 below when invoked on the 
cut C = E £ : Eij < 0}). This cannot lead to a contradiction because by Definition 9 

you will flip brush color an even number of times along any cycle in the graph, hence always 
end a cycle on the same color you started it with. 

The presence of frustration unfortunately makes the computation of ground states much 
harder — in fact, it is known to be NP-hard in general (Barahona, 1982). As shown below, 
the ground state of a planar Ising model can be computed in polynomial time via maximum- 
weight perfect matching on an expanded dual of its embedded graph. 

A relationship between the states of a planar Ising model and perfect matchings ("dimer 
coverings" to physicists) was first established by Kasteleyn (1961, 1963, 1967) and Fisher 
(1961, 1966). Perfect matchings in dual graph constructs were used by Bieche et al. (1980) 
and Barahona (1982) to compute Ising ground states; below we generalize a simpler con- 
struction for triangulated graphs due to Globerson and Jaakkola (2007). For rectangular 
lattices our approach reduces to the construction of Thomas and Middleton (2007), though 
their algorithm to compute ground states is somewhat less straightforward. Pardella and 
Liers (2008) apply this construction to very large square lattices (up to 3000x3000 nodes), 
and find it to be more efficient than earlier methods (Bieche et al., 1980; Barahona, 1982). 

3.1 The Expanded Dual Graph 

Definition 10 The dual G*(J r , £) of an embedded graph G(V,£,II) has a vertex for each 
face ofG, with edges connecting vertices corresponding to faces that are adjacent (i.e., share 
an edge) in G. 
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Figure 4: Left column: Square face of a plane graph (dashed lines) with its ordinary (top) 
resp. expanded (bottom) dual graph (solid lines). Other columns: Binary node 
states (open and filled circles) on the graph, induced cuts (bold blue dashes), and 
complementary perfect matchings (bold red lines) of the expanded dual. 



Figure 4 (top left) shows the dual for a square face of our plane graph. Each edge of the dual 
crosses exactly one edge of the original graph; due to this one-to-one relationship we will 
consider the dual to have the same set of edges £ (with the same energies) as the original. 
Since the nodes in the dual are different, however, we will (with some abuse of notation) use 
a single index for dual edges and their weights, corresponding to the index of the original 
edge in some arbitrary ordering of £. Thus if (i, j) is the k th edge in the (ordered) £, its 
dual will have weight '= Eij. 

We now expand the dual graph by replacing each of its nodes with a (/-clique, where q is 
the degree of the node, as shown in Figure 4 (bottom left) for a dual node with degree q — 4, 
and in Figure 5 for an entire graph. The additional edges internal to each (/-clique are given 
zero energy so as to leave the model unaffected. For large q the introduction of these 0(q 2 ) 
internal edges slows down subsequent computations (solid line in Figure 8, left); this can be 
avoided by subdividing the offending (/-gonal face of the model graph with chords (which 
are also given zero energy) before constructing the dual. Our implementation performs best 
when "octangulating" the graph, i.e., splitting octagons off all faces with q > 13; this is 
more efficient than a full triangulation (Figure 8, left). 

It is easily seen that the expanded dual has 2\£\ vertices, two for each edge in the original 
graph. We therefore give the two vertices connected by the dual of the k th edge in £ the 
indices 2k — 1 and 2k (cf. Section 4.3.1 and Figure 7). This consistent indexing scheme 
allows us to run the inference algorithms described in the remainder of this paper without 
explicitly constructing an expanded dual graph data structure. 
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Figure 5: Example of a planar Ising model with five binary nodes (large, blue) including a 
constant- valued bias node (top left), and its expanded dual (small nodes, red), in 
two different states (left & right). The graph cut induced by the given state and 
its complementary perfect matching of the expanded dual are shown in bold, as 
are the nodes left unmatched by the complement M' := £\C of the cut in the 
expanded dual. 



3.2 Complementary Perfect Matchings 

Definition 11 A perfect matching of a graph G(V,£) is a subset M C £ of edges wherein 
exactly one edge is incident upon each vertex: \/v G V, \v\ — 1 in G(V,M). Its weight \M\ 
is the sum of the weights of its edges. 

Figure 5 shows two perfect matchings (in bold) of the nodes of the expanded dual of an 
Ising model. There is a complementary relationship between such perfect matchings and 
graph cuts in the original Ising model, characterized by the following two theorems. The 
reader may find it helpful to refer to Figure 5 while going through the proofs. 

Theorem 12 For every cut C of an embedded graph G(V,£,II) there exists at least one (if 
G is triangulated: exactly one) perfect matching Ai of its expanded dual complementary to 
C, i.e., £\M =C. 

Proof By construction, the set £ of edges of G constitutes a perfect matching of the 
expanded dual. Any subset of £ therefore is a (possibly partial) matching of the expanded 
dual. The complement M r := £\C of a cut of G is a subset of £ and thus a matching of the 
expanded dual; it obeys £\M f — £\(£\C) = C. The nodes that M' leaves unmatched in 
the expanded dual (circled bold in Figure 5) are those neighboring the edges C of the cut. 

By definition, C intersects any cycle of G, and therefore also the perimeters of G's faces 
J 7 , in an even number of edges. In each clique of the expanded dual, M f thus leaves an 
even number of nodes unmatched. It can therefore be completed into a perfect matching 
M 2 M' using only edges interior to the cliques to pair up unmatched nodes. These edges 
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have no counterpart in the original graph: (M\M r ) fl £ = 0. We thus have 

£\M = £\[(M\M')UM'] = [S\(M\M')]\M' = £\M' = C. (12) 

In a 3-clique of the expanded dual, Ai' will leave two nodes unmatched or none at all; in 
either case there is only one way to complete the matching, by adding one edge resp. none 
at all. By construction, if G is triangulated all cliques in its expanded dual are 3-cliques, 
and so M is unique. ■ 

In other words, there exists a surjection from perfect matchings in the expanded dual of G 
to cuts in G. Furthermore, since we have given edges interior to the cliques of the expanded 
dual zero energy (i.e., \M\M.'\ = 0), every perfect matching M complementary to a cut C 
of our Ising model (2) obeys the relation 

|M| + |C| = \M'\ + \C\ = Yl Ei i = const ' ( 13 ) 

This means that instead of a minimum- weight cut in a graph we can look for a maximum- 
weight perfect matching in its expanded dual. But will that matching always be comple- 
mentary to a cut? The following theorem shows that this is true for plane graphs: 

Theorem 13 Every perfect matching Ai of the expanded dual of a plane graph G(V, £, II) 
is complementary to a cut C of G, i.e., £\M = C. 

Proof By definition, £\M is a cut of G iff it intersects every cycle O C £ of G an even 
number of times. This can be shown by induction over the faces of the embedding: 

Base case — let O C £ be the perimeter of a face of the embedding, and consider 
the corresponding clique of the expanded dual: M matches an even number of nodes in 
the clique via interior edges; all other nodes must be matched by edges crossing O. The 
complement of the matching in G thus intersects O an even number of times: 

|(£VW)nO|=0 mod 2. (14) 

Induction — let 0\,0 2 Q £ be cycles in G obeying (14), and consider their symmetric 
difference 1 A0 2 := {O x U 2 )\(0 1 fl 2 ): 

\(e\M)n(0 1 A0 2 )\ = \[(£\M)n(0 1 uo 2 )]\(0 1 no 2 )\ 

= \[(£\M) n d] u [(£\M) n 2 ]\ - \{£\M) n [O x n 2 )\ 
= \(£\M) n Oi| + \(£\M) n 2 \ - 2\{£\M) n [p x n 2 )\ 

= + 0-2n = mod 2. (n e N) (15) 

We see that property (14) is preserved under composition of cycles via symmetric differences, 
and thus holds for all cycles that can be composed from face perimeters of the embedding 
of G in this fashion. 

All cycles in a plane graph G are contractible on its embedding surface (a plane resp. 
sphere) because that surface has zero genus, and is therefore simply connected. All con- 
tractible cycles of G can be constructed by composition of face perimeters via symmetric 
differences, thus have an even intersection with £\M, which is therefore a cut. ■ 
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This is where planarity matters: Surfaces of non-zero genus are not simply connected, and 
thus non-plane graphs may contain non-contractible cycles (e.g., encircling the hole of a 
torus). In the presence of frustration, our construction does not guarantee that the comple- 
ment £\M of a perfect matching of the expanded dual contains an even number of edges 
along such cycles. For planar graphs, however, the above theorems allow us to leverage 
known polynomial-time algorithms for perfect matchings into inference methods for Ising 
models. This approach also works for non-planar Ising models that contain no frustrated 
non-conctractible cycle. 

We note that if all cliques of the expanded dual are of even size, there is also a direct (non- 
complementary) surjection from perfect matchings to cuts in the original graph. In contrast 
to our complementary map, the direct surjection requires the addition of dummy vertices 
into the expanded dual for faces of G with odd perimeter so as to make the corresponding 
cliques even (Kasteleyn, 1963; Fisher, 1966; Liers and Pardella, 2008). 

3.3 Computing the Ground State 

The blossom-shrinking algorithm (Edmonds, 1965a, b) is a sophisticated method to effi- 
ciently compute the maximum-weight perfect matching of a graph. It can be implemented 
(Mehlhorn and Schafer, 2002) to run in as little as 0(\£\ \V\ log |V|) time (Galil et al., 1986). 
Although the Blossom IV code we are using (Cook and Rohe, 1999) is asymptotically less 
efficient — 0(\£\ |V| 2 ) — we have found it to be very fast in practice (Figure 8, left). 

We can now efficiently compute the lowest-energy state of a planar Ising model as follows: 
Find a planar embedding of the model graph (Section 2.1), construct its expanded dual 
(Section 3.1), and run the blossom-shrinking algorithm on that to compute its maximum- 
weight perfect matching. Its complement in the original model is the minimum- weight graph 
cut (Section 3.2). We can identify the state which induces this cut via a depth- first graph 
traversal (Algorithm 1) that labels nodes as it encounters them, and checks for consistency 
on subsequent encounters. The traversal starts by labeling the bias node with its known 
state yo := 0. 

4. Computing the Partition Function and Marginal Probabilities 

A Markov random field (MRF) over our Ising model (2) models the distribution 

P( y ) = i e -*), where Z :=^e~ E ^ (16) 

y 

is the MRF's partition function. As it involves a summation over exponentially many states 
y, calculating the partition function is generally intractable. For planar graphs, however, 
the generating function for perfect matchings can be calculated in polynomial time via 
the determinant of a skew-symmetric matrix (Kasteleyn, 1961, 1963, 1967; Fisher, 1961, 
1966), which we call the Kasteleyn matrix K. Due to the close relationship with graph cuts 
(Section 3.2) we can apply this method to calculate Z in (16). We first convert a planar 
embedding of the Ising model graph into a Boolean "half-Kasteleyn" matrix in four 
steps which will be elaborated below: 
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Algorithm 1 Find State from Corresponding Graph Cut 

Input: Ising model graph G(V,£), g r &ph cut C(y) C £ 

1. Vz £ {0, 1, 2, ... n} : ^ := unknown; 

2. dfs_state(0, 0); 
Output: state vector y 

procedure dfs_state(z E {0, 1,2,... n}, 5 G {0, 1}) 
if yi = unknown then 

1. yi := 5; 

2. V(i,j)e£: 

if GC then 

dfs_state(j, ->s); 
else dfs_state(j, 5); 
else assert ^ = s; 



1. Section 4.1, Algorithm 2: pZcme triangulate the embedded graph so as to make the 
relationship between cuts and complementary perfect matchings a bijection (cf. Sec- 
tion 3.2); 

2. Section 4.2, Algorithm 3: orient the edges of the graph such that the in-degree of 
every node is odd; 

3. Section 4.3.3, Algorithm 4: construct the Boolean half-Kasteleyn matrix H from the 
oriented graph; 

4. Section 4.4.3: prefactor the triangulation edges (added in Step 1) out of H. 

Our Step 2 simplifies equivalent operations in previous constructions (Kasteleyn, 1963, 1967; 
Fisher, 1966; Globerson and Jaakkola, 2007), Step 3 differs in that it only sets unit (ie., 
+1) entries in a Boolean matrix, and Step 4 can dramatically reduce the size of H for 
compact storage (as a bit matrix) and faster subsequent computations (Figure 8). 

Edge disagreement costs do not enter into H\ they are only taken into account in a 
second phase, when the full Kasteleyn matrix K is constructed from H (Section 4.3.2). We 
can then factor K (Section 4.4) and compute the partition function from its determinant 
(Section 4.4.1; Kasteleyn, 1961; Fisher, 1961). This factorisation can also be used to invert 
K, which is required to obtain marginal probabilities of disagreement on the edges of the 
model graph (Section 4.4.2). 

In what follows, we elaborate in turn on the graph operations of plane triangulation 
(Section 4.1) and odd edge orientation (Section 4.2), and construction (Section 4.3) and 
factoring (Section 4.4) of the Kasteleyn matrix K resp. H. 
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(a) (b) (c) (d) 

Figure 6: (a) A chordal graph whose external face is not triangulated; (b) a plane trian- 
gulated graph that has a 4-cycle (bold blue) without a chord; (c) proper, (d) 
improper plane triangulation (dashed) of the plane graph from Figure 2d. 



4.1 Plane Triangulation 

Definition 14 An embedded graph is plane triangulated iff it is biconnected and each of 
its faces (including the external face) is a triangle. 

Note that plane triangulation is not equivalent to making a graph chordal, though the latter 
process is sometimes also called "triangulation". For instance, the graph in Figure 6a is 
chordal but not plane triangulated because the external face is not triangular, while that 
in Figure 6b is plane triangulated but not chordal because it contains a 4-cycle (bold blue) 
that has no chord. 

We can plane triangulate an embedded graph in linear time by traversing all of its 
faces, inserting chords as necessary as we go along (Algorithm 2). This may create multiple 
edges between the same two vertices, as shown in Figure 6c. Care must be taken when 
encountering singly connected components with a perimeter of length two, which could 
cause the insertion of a self-loop (see Figure 6d). Algorithm 2 detects and biconnects such 
components, as Definition 14 requires. 

The insert _chord(z, j, k) subroutine of Algorithm 2 not only updates £, but also 7V{ and 
TTk so as to insert the new edge in its proper place in the rotation system. In order 

to leave the distribution (16) modeled by the graph unchanged, the new edge is given zero 
energy. Repeated traversals of the same face in Algorithm 2 can be avoided by the obvious 
use of "done" flags, omitted here for the sake of clarity. 

Note that plane triangulation is not strictly necessary for the computation of partition 
function or marginal probabilities; it merely simplifies subsequent steps in the construc- 
tion. Previously (Fisher, 1966; Globerson and Jaakkola, 2007) this convencience came at 
a computational price: the edges added during plane triangulation can make factoring and 
inversion of K (Section 4.4) significantly (up to 20 times) more expensive. We avoid this 
cost by removing the triangulation edges again before constructing during prefactoring 
of the half-Kasteleyn bit matrix H (Section 4.4.3). 
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Algorithm 2 


Plane Inangulation 


Input: 


plane graph G(V,£,II) with V > 3 




Vz G V : 




V(i, j) G : 




1. (j,k) := 7Tjf (jf, z) 5 




2. (fc,Z) :=7r fc (fc,j); 




3. while Z^z V tt z (Z, fc) ^ (/, j) 




(a) if z = then (avoid self-loop) 




i := j; 




j := fc; 












(b) insert_chord(z, j, fc); 




(c) z := fc; j := /; 




(d) := 7Tj(j, z); 




(e) (fc, /) := 7v k (k,j)] 


Output: 


plane triangulated graph G(V,£,II) 


procedure 


insert _chord(z, j, fc G V) 




1. c := c U {(z, /cjj; 




2. 7r k (k,i) := 7Tjfe(fc,j); 




3. 7Tjfe(fc,j) := (M); 




4. Tr^Trr 1 ^^-)) := (i,fc); 




5. 7Ti(i,A;) := (z, j); 




6. := 0; 



4.2 Odd Edge Orientation 

To calculate the generating function for perfect matchings, the graph in question (namely, 
the expanded dual of our model graph) must be given a clockwise odd orientation. 

Definition 15 An orientation of an undirected graph G(V, £) is a set £' of oriented edges 
with \£'\ = \£\ such thatV(i,j) G £, £' contains either (z,j) or (j,z). 

Definition 16 (Kasteleyn, 1963) An orientation of an embedded graph is clockwise odd 
iff the number of edges oriented clockwise around each face (except possibly the external 
face) is odd. 

Consider Figure 7: by giving all interior edges of the 3-cliques of the expanded dual 
a clockwise orientation (small red arrows), we ensure that (a) the interior faces of the 
3-cliques have clockwise odd orientation, and (b) all interior edges of the 3-cliques are 
oriented counterclockwise wrt. all faces exterior to the 3-cliques, hence do not affect the 
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Figure 7: Clockwise odd orientation (Section 4.2) and indexing scheme (Section 4.3.1) for 
the expanded dual (red, small nodes) of the model graph (blue, large nodes). 



latters' clockwise odd orientation status. It remains to consider the orientation of edges 
external to the 3-cliques (large red arrows). What does a clockwise odd orientation of these 
edges correspond to in the original model graph? To map these edges back into the model 
graph, rotate them clockwise by 90 degrees. A face with clockwise odd orientation of its 
perimeter in the dual thus maps to a vertex with an odd in-degree, i.e., an odd number 
of edges oriented towards it. This facilitates a drastic simplification of this step in our 
construction: 

To establish a clockwise odd orientation of the expanded dual, simply orient the edges of the 
model graph such that all vertices, except possibly one, have an odd in-degree. 

Algorithm 3 achieves this in linear time by orienting edges appropriately upon return from 
a depth-first traversal of the graph. In contrast to earlier constructions (Kasteleyn, 1963; 
Fisher, 1966; Globerson and Jaakkola, 2007), it does not require following orbits around 
faces, and in fact does not refer to an embedding II or dual graph G* at all. 

Any vertex can be chosen to be the exceptional vertex s, since the choice of external face 
of a plane drawing is arbitrary — it is an artifact of the drawing, not an intrinsic property 
of the embedding: a planar graph embedded on a sphere has no external face. 

4.3 Constructing the Kasteleyn Matrix 

The Kasteleyn matrix K is a skew-symmetric, 2\£\ x 2\£ | matrix constructed from the Ising 
model graph whose determinant is the square of the partition function. Our construction 
improves upon the work of Globerson and Jaakkola (2007) in a number of ways: 
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Algorithm 


3 Construct Odd Edge Orientation 


Input: 


undirected graph G(V, £) 


i 

i . 


v v kz v . i;.visueQ — raise, 


2. 


pick arbitrary edge (r, s) G £; 


3. 


£' :={(r,5)}; 


4. 


make_odd(r, 5); 


Output: 


orientation £' of £ : Vi; G V\{s}, 




in-degree(i;) = 1 mod 2 in G(V, £') 


function 


make_odd: (^, uEV)^ {true, false} 


1. 


£ := 


2. 


if i;. visited then return true; 


3. 


v. visited := true; 


4. 


odd := false; 


0. 


V W t K • \c7, 7/7 1 t C 




if make_odd(i;, it;) then 




(a) £':= 5' U {(«;,«)}; 




(b) odd := ^odd; 




else £' :=£'U{(v,w)}] 


6. 


return odd; 



• We employ an indexing scheme that obviates any need to refer to the expanded dual 
of the model graph (which we consequently never explicitly construct at all); 

• We break construction of the Kasteleyn matrix into two phases, the first of which is 
invariant with respect to the model's disagreement costs; 

• We make the "half-Kasteleyn" matrix H computed in the first phase very compact 
by prefactoring out the triangulation edges (see Section 4.4.3) and storing it as a bit 
matrix. 



4.3.1 Indexing Scheme 

Without loss of generality, let £ = {ei,e2, . . . e\g\}- Note that the expanded dual has 2\£\ 
vertices, one lying to either side of every edge in the model graph. When viewing edge 
ek along its direction in £' \ we label the dual node to its right 2k — 1 and that to its 
left 2k; see Figure 7 for an example (cf. Section 3.1). One benefit of this scheme is that 
quantities relating to the edges of the model graph (as opposed to internal edges of the 
cliques of the expanded dual) will always be found on the superdiagonal of K. We also 
notationally extend the rotation system from Section 2.1 to support this indexing scheme: 
ek = (hj) e^*.) = 7Ti(iJ). 
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ring size (nodes) ring size (nodes) ring size (nodes) 

Figure 8: Cost of planar Ising inference methods on a ring graph, plotted against ring size. 

Left & center: CPU time on Apple MacBook with 2.2 GHz Intel Core2 Duo 
processor, averaged over 100 repetitions. Left: MAP state calculated via Blos- 
som IV (Cook and Rohe, 1999) on original, triangulated, and octangulated ring. 
Center: marginal edge probabilities with vs. without prefactoring. Right: size of 
K (double precision, no prefactoring) vs. prefactored bit matrix H, in uncom- 
pressed (solid lines) vs. compressed form (dashed lines), using row-compressed 
sparse matrix storage for K and bzip2 compression for H. 



4.3.2 Two-Phase Construction 

In a first phase we process the Ising model graph's structure into a Boolean "half-Kasteleyn" 
matrix H which does not yet include disagreement costs. For a given set of edge disagree- 
ment costs Ek,k = {1, 2, , . . . \£ |}, we then build from H the conventional, real- valued 
Kasteleyn matrix K by adding the exponentiated disagreement costs along the superdiag- 
onal and skew-symmetrizing: 

1. K := H. 

2. Vfc G {1, 2, , . . . \£\\ : K 2k -i,2k := K 2 jfe-i,2Jfe + e* 7 *; 

3. K \— K — K T . 

This two-phase approach holds a number of advantages: 

• When working with a large number of isomorphic graphs (as we do in Section 6), the 
corresponding half-Kasteleyn matrix is identical for all of them, hence needs to be 
constructed just once. 

• During maximum likelihood parameter estimation, partition function and/or marginals 
have to be recomputed many times for the same graph, with disagreement costs vary- 
ing due to the ongoing adaptation of the model parameters. H remains valid when 
disagreement costs change, so we can compute it just once upfront, then re-use it in 
the parameter estimation loop. 

• H can be stored very compactly as a prefactored bit matrix. As Figure 8 (right) 
shows, the uncompressed H can be several orders of magnitude smaller than the 



20 



Efficient Exact Inference in Planar Ising Models 



corresponding Kasteleyn matrix K. Row-compressed sparse storage of K (which has 
exactly 3 non-zero entries in each row and column) is more efficient, but applying the 
bzip2 compressor 3 to the prefactored bit matrix H yields by far the most compact 
storage format. Such memory efficiency becomes very important when working with 
large data sets of non-isomorphic graphs. 



Algorithm 4 Construct Half-Kasteleyn Bit Matrix 



Input: 


oriented, embedded, 




triangulated graph G(V,£',H) 


1. 


H : 


= 0G {0,l} 2 l f 'l x2 l £ 'l 


2. 


Vv e V: 




(a) 


e s := any edge incident on v; 




(b) 


if e s = (•, v) (edge points to v) 






then a := 2s; 






else a := 2s — 1; 




(c) 


i := tt v (s); 




(d) 


repeat 






if ei = (-, v) (edge points to v) 






then 






H2i-i,a •= 1; 






a := 2z; 






if ei was created by Algorithm 2 (plane triangulation) 






then i?2i-i,2i •= 1; 






else 






^2i,« := 1; 






o: := 2z — 1; 






i := tt,;(z); 






until i = 7r v (s); 


Output: 


Half-Kasteleyn bit matrix H 



4.3.3 Algorithm for Constructing H 

Using the above indexing scheme, Algorithm 4 constructs H in linear time, cycling once 
through the edges incident upon each vertex of the model graph. It deviates from the 
classical construction of K (Kasteleyn, 1961; Fisher, 1961; Globerson and Jaakkola, 2007) 
in that it makes only positive entries, and only those corresponding to edges with zero 
disagreement cost, i.e., added during plane triangulation or internal to the cliques of the 

3. http://www.bzip.org/ 
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expanded dual. All such entries have the value e° 
can be stored compactly as a bit matrix. 



1, making H a Boolean matrix, which 



4.4 Factoring Kasteleyn Matrices 

Standard approaches such as LU-factorization can be used to factor the Kasteleyn matrix 
but do not exploit its skew symmetry. Here we develop a numerically stable Cholesky- 
like factorization for even-sized skew-symmetric matrices, which can be used to factor K 
as well as to prefactor H (see below). Begin by writing the Kasteleyn matrix as 



(17) 








c 


a T 1 


K = 


— c 









—a 


-b 


C 



for some scalar c, vectors a and 6, and a matrix C which is either empty or again of the 
same form. We factor (17) into (cf. Bunch, 1982; Benner et al., 2000) 



(18) 








-1 


T 


K = 


1 





T 




a/c 


b/c 


I 






c 


o T - 


— c 





T 








a 






1 


a T /c 


-l 





b J /c 








I 



where C is the Schur complement 

C' := C + (ba T -ab T )/c. 



(19) 



Iterated application of (18) to the Schur complement ultimately yields K = R T JR, where 





' 1 

-1 


aj/ci 
&T/d 


R:= 






1 
-1 


a^/c 2 
&I/c 2 











1 








-1 






(20) 



and 



J := 



ci 










-ci 












c 2 









-c 2 























c\ e \ 










-c\e\ 



(21) 



To prevent small pivots C{ from causing numerical instability, pivoting is required. Since 
Kasteleyn matrices have at least two entries of unit magnitude in each row and column, 
partial pivoting suffices. 
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4.4.1 Partition Function 



The partition function for perfect matchings is (Kasteleyn, 1961; Fisher, 1961). Our 

factoring gives \R\ = 1 and | J| = ]\i c i > so we nave 

\S\ 



^\K\ = y\R T \ \ J\ \R\ = y/\J\ = Y[\ci\. (22) 

1=1 

Calculation of the product in (22) is prone to numerical overflow; this is easily avoided by 
working with logarithms instead. Using the complementary relationship (13) with graph 
cuts in planar Ising models, we obtain the log partition function for the latter as 

InZ := ln^e"^ fceC ^) Ek = In ^ e -^ke£ E k-T, k ^c( y ) E k) ^3) 
y y 

\e\ 

= ln(e"^*efi^fc^ e ^c(y)^fc) = ]ny/\K\ -^Ej, = J^(ln \a\ - £7<). 

y keS i=l 

4.4.2 Marginal Probabilities 

The marginal probability of disagreement along an edge equals the negative gradient of the 
log partition function (23) with respect to the disagreement costs. Computing this involves 
the inverse of K: 



-E(y) _ 1 dZ _ d\nZ 



Z dE k dE k 

y: keC(y) y. keC(y) 



2fc-l,2fc? 



(24) 



where tr denotes the matrix trace, and we have used the fact that K -1 is skew-symmetric 
as well. To invert K, observe from (20) resp. (21) that R and J are essentially triangular 
resp. diagonal (simply swap rows 2k — 1 and 2fc, k — 1,2,... and thus easily inverted. 
Then use K 1 = R- 1 J- 1 R~ T to obtain 

2\S\ 2\S\ \S\ 

[K^hk-l^k = ^2 [R~ 1 hk-l,i[J~ 1 }hj[R~ T }j,'2k = h ^2 ^ ik ' ^ 25 ^ 

i=l j=l ° k i=k+l 



where := 



[R 1 ]2/c-l,2i[^R 1 ]2/c,2z-l — [R 1 ]2/c-l,2z-l[^R 1 ]2/c,2i 



4.4.3 Prefactoring 

Consider the rows and columns of K corresponding to an edge added during plane trian- 
gulation (Section 4.1). Re-order K to bring those rows and columns to the top left, so that 
they form the a, 6, and c of (17). Since the disagreement cost of a triangulation edge is 
zero, we now have a unity pivot: c = e° = 1. This has two advantageous consequences: 

Size reduction: The unity pivot does not affect the value of the partition function. 
Since we are not interested in the marginal probability of triangulation edges (which after all 
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are not part of the original model), we do not need a or b either, once we have computed the 
Schur complement (19). We can therefore discard the first two rows and first two columns 
of K after factoring (17). Factoring out all triangulation edges in this fashion reduces the 
size of K (resp. R and J) to range only over the edges of the original Ising model graph. 
This reduces storage requirements and speeds up subsequent computation of the inverse 
(Figure 8, center). 

Boolean closure: The unity pivot eliminates the division from the Schur complement 
(19); in fact we show below that applying (18) to prefactor H — H T yields a Schur com- 
plement that can be expressed as H' — H'^ where H' is again a Boolean matrix. This 
closure property allows us to simply prefactor triangulation edges directly out of H without 
explicitly constructing K. 

Specifically, let K \— H — H T for a half-Kasteleyn matrix H with elements in {0, 1}. 
Without loss of generality, assume that H and its transpose are disjoint, i.e., have no 
non-zero element in common: H H J = 0, where denotes Hadamard (element-wise) 
multiplication. Algorithm 4 respects this condition; violations would cancel in the con- 
struction of K anyway. Expressing H as 








1 






H = 








bj 


(26) 




a 2 


b 2 







we can write K — H — H as (17) with a — a\ — a 2l b — b\ — b 2l c — 1, and C = C\ — Cj. 
The Schur complement (19) then becomes 

C" = Ci - Cj + (61 - b 2 )(a 1 - a 2 ) T - (ai - a 2 )(6i - b 2 ) T 

(Ci + b x al + 6 2 a2 + ai&J + a 2 bj) - (Cj + a^J + a 2 &2 + b 2 aj + 61 aj ) 
=: H'-H' T , (27) 

where 

H' = Cx + bxal + 6 2 aJ + ai&J + a 2 bj. (28) 

It remains to show that all elements of H' are in {0,1}. By definition of H (26), all 
elements of Ci, ai, a 2l 61, b 2 are in {0, 1}, and by closure of multiplication in {0, 1} so are 
their products. Thus an element of H f will be in {0, 1} iff it is non-zero in at most one term 
on the right-hand side of (28), or (equivalently) iff all pairs formed from the five terms in 
question are disjoint. This can indeed be shown to be the case: 

• Because H H T = 0, we know that neither a± and a 2 nor b\ and b 2 can have any 
non-zero elements in common, so b\ aj and 62^2 are disjoint, as are a\b^ and a 2 bj . 

• By construction of H (Algorithm 4), a\ and b\ {resp. a 2 and b 2 ) can only have 
an element in common if the dual nodes on both sides of the corresponding edge are 
members of the same clique. This cannot happen because we explicitly ensure that 
the graph becomes biconnected during plane triangulation (Section 4.1), so that an 
edge cannot border the same face of the model graph on both sides. Thus all four 
outer products in (28) are pairwise disjoint. 
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• Finally, each outer product in (28) is disjoint from C\ as long as the edges being 
factored out do not form a cut of the model graph (i.e., cycle of the dual). We 
are prefactoring only edges added during triangulation; for these to form a cut the 
model graph would have to have been disconnected prior to triangulation. This can- 
not happen here because we deal with (and eliminate) this possibility during earlier 
preprocessing (Section 2.3). 

In summary, all five terms on the right-hand side of (28) are pairwise disjoint. Thus the 
Schur complement H' is a Boolean matrix as well, and can be computed from H (26) very 
efficiently by replacing the additions and multiplications in (28) with bitwise OR and AND 
operations, respectively. As long as further triangulation edges remain in H', we then set 
H := H' and iteratively apply (26) and (28) so as to prefactor them out as well. 

5. Application to CRF Parameter Estimation 

Our algorithms can be applied to regularized maximum likelihood and maximum margin 
parameter estimation in conditional random fields (CRFs). In a standard planar Ising CRF, 
the disagreement costs in (2) are computed as Ek := T Xk, i-e., as inner products between 
local features (sufficient statistics) Xk of the modeled data at each edge fc, and corresponding 
parameters 6 of the model. The conditional distribution P(y|x, 6) (where x represents the 
union of all local features) is then modeled as a Markov random field (16). 

5.1 Maximum Likelihood 

Maximum-likelihood (ML) CRF parameter estimation seeks to minimize wrt. the L2- 
regularized negative log likelihood 

Lml(0) := ^A||0|| 2 - lnP(y>,0) 

= ^A||0|| 2 + E(y*\x,0) + lnZ(0\x) (29) 

of a given target labeling y*, 4 with regularization parameter A. This is a smooth, convex, 
non-negative objective that can be optimized via gradient methods such as LBFGS, either 
in conventional batch mode (Nocedal, 1980; Liu and Nocedal, 1989) or online (Schraudolph 
et al., 2007). The gradient of (29) with respect to the parameters is given by 

^Lml(0) = XO + ^ ([fc G C(i/*)] -F(keC(y\x,0)f)x k . (30) 

kes 

The contribution of each edge k to the gradient (30) is given by the product between its 
local features x^ and the difference between the indicator function for membership of k in 
the cut induced by the target state and the marginal probability of k being contained in 
a cut, given x and 0. We compute the latter via the inverse of the Kasteleyn matrix (24). 

4. For notational clarity we suppress here the fact that we are usually modeling a collection of data items; 
the objective function for such a set is simply the sum of objectives for each individual item in it. 
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5.2 Maximum Margin 

For maximum-margin (MM) parameter estimation (Taskar et al., 2004) we instead minimize 

Lmm(0) ■■= |A||6>|| 2 + E(y*\x,0)- mm M(y\y*x,0) (31) 

y 

= ±A||0|| 2 + E(y* \x,0) -E(y\x,6) + d(y\y*), 

where y := argmin y M(y\y* x, 0) is the worst margin violator, i.e., the state that minimizes, 
relative to a given target state y*, 4 the margin energy 

M(y\y*) := E(y) - d(y\y*), (32) 

where d(-\-) is a measure of divergence in state space. If d(-\-) is a weighted Hamming 
distance between induced cuts: 

d(y\y*) := J> G C(y)] + [k e C(y*)]] v kl (33) 

kes 

where the v k > are constant weighting factors (in the simplest case: all ones) on the edges 
of our Ising model, then it is easily verified that the margin energy (32) is implemented (up 
to a shift that depends only on y*) by an isomorphic Ising model with disagreement costs 

E k + (2[k€C(y*)]-l)v k . (34) 

We can thus use our algorithm of Section 3.3 to efficiently find the worst margin violator. 5 
The maximum- margin objective (31) is convex but non-smooth; its gradient is 

—L MM (0) = XO + 5^([fceC(y*)] -[keC(y)])x k , (35) 

kes 

i.e., local features x k are multiplied by one of { — 1,0, 1}, depending on the membership of 
edge k in the cuts induced by and respectively. We can minimize (31) via bundle 
methods, such as the BT bundle trust algorithm (Schramm and Zowe, 1992), making use 
of the convenient lower bound V0: Lmm(#) > 0, which holds because 

minM(y\y*x,0) < M(y*\y*x,0) = E(y*\x,0) - d(y*\y*) = E(y*\x,0). (36) 
y 

6. Experiments 

We now demonstrate the scalability of our approach to CRF parameter estimation (Sec- 
tion 5) on two simple computer vision problems: the synthetic binary image denoising task 
of Kumar and Hebert (2004, 2006), and the detection of segmentation boundaries in noisy 
masks from the GrabCut Ground Truth image segmentation database (Rother et al., 2007a). 

5. Thomas and Middleton (2007) employ a similar approach to obtain the ground state from a given state 
y* by setting up an isomorphic Ising model with disagreement costs (1 — 2 [k G C(y*)])Ek- 
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6.1 Synthetic Binary Image Denoising 

Kumar and Hebert (2004, 2006) developed an image denoising benchmark problem for 
binary random fields based on four hand-drawn 64x64 pixel images (Figure 9, top row). 
We created 50 instances of each image corrupted with pink noise, produced by convolving a 
white noise image (all pixels i.i.d. uniformly random) with a Gaussian density of one pixel 
standard deviation. Original and pink noise images were linearly mixed using signal-to- 
noise (S/N) amplitude ratios of l:n, n G N. Figure 9 shows samples of the resulting noisy 
instances for S/N ratios of 1:5 (second row) and 1:6 (fourth row). 

We then employed square grid planar Ising CRFs to denoise the images, with edge 
energies set to Eij := ([1, \xi — Xj\), 0), where X{ is the pixel intensity at node i. The 
perimeter of the grid was connected to a bias node with constant input and label: xo = 
Ho = 0. Bias edges had their own parameters, yielding CRFs with four parameters and up 
to (for a 64x64 grid) 4097 nodes and 8316 edges. 

These systems were trained by maximum margin (MM) and maximum likelihood (ML) 
parameter estimation (Section 5) on the 50 noisy instances derived from the first image 
(Figure 9, left column) only. The gradient (35) resp. (30) was computed by adding the 
contributions from all 50 training instances to that of the regularizer, with A = 100. We as- 
sessed the quality of the obtained parameters by determining (via the method of Section 3.3) 
the maximum a posteriori (MAP) states 

?/* := argmaxP(y|x, 0) — argmin£'(y|x, 0) (37) 
y y 

of the trained CRF for all 150 noisy instances of the other three images. Interpreting these 
MAP states as attempted reconstructions of the original images, we then calculated the 
prediction error rates for both nodes and edges of the model. 

All experiments were carried out on a Linux PC with 3 GB RAM and dual Intel Pentium 
4 processors running at 3.6 GHz, each with 2MB of level 2 cache. 

6.1.1 Noise Level 

We first explored the limit of the ability of a full-size (64x64) MM-trained Ising grid CRF 
to reconstruct the test images as the noise level increases. Rows 3 and 5 of Figure 9 show 
the reconstructions obtained from the noisy instances shown in rows 2 and 4, respectively. 
At low noise levels (n < 5) we obtained perfect reconstruction of the original images. At 
an S/N ratio of 1:5 the first subtle errors do creep in (Figure 9, third row), though less 
than 0.5% of the nodes (0.3% of the edges) are predicted incorrectly. At the 1:6 S/N ratio, 
these figures increase to 2.1% for nodes and 1.15% for edges, and the errors become far 
more noticeable (Figure 9, bottom row). For higher noise levels (n > 6) the reconstructions 
rapidly deteriorate as the noise finally overwhelms the signal. At these noise levels our 
human visual system was no longer able to accurately reconstruct the images either. 

6.1.2 Maximum Margin vs. Maximum Likelihood Parameter Estimation 

Next we compared MM and ML parameter estimation at the S/N ratio of 1:6 (fourth row 
of Figure 9), where reconstruction begins to break down and any differences in performance 
should be evident. To make ML training computationally feasible, we subdivided each 
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Figure 9: Denoising of binary images by maximum-margin training of planar Ising grids; 

from top to bottom: original images (Kumar and Hebert, 2004, 2006), images 
mixed with pink noise in a 1:5 ratio, reconstruction via MAP state of 64x64 
Ising grid CRF from those noisy instances, images mixed with pink noise in a 1:6 
ratio, and reconstruction from those noisy instances. Only the left-most image 
was used for training (max-margin CRF parameter estimation, A = 100), the 
others for reconstruction. 
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Table 1: Performance comparison of parameter estimation methods on the denoising task; 

image reconstruction via MAP on the full (64x64) model and images. The mini- 
mum in each result column is boldfaced. 



Train Method 


Patch Size 


Train Time 


Edge Error 


Node Error 


MM 


64 x 64 


490.4 s 


1.15% 


2.10% 


32 x 32 


174.7s 


1.16% 


2.15% 


16 x 16 


91.4s 


1.12% 


1.98% 


8x8 


78.1s 


1.10% 


1.83 % 


ML 


5468.2 s 


1.11% 


1.93% 



training image into 64 8x8 patches, then trained an 8x8 grid CRF on those patches. For 
MM training we used the full (64x64) images and model, as well as 32x32, 16x16, and 8x8 
patches, so as to assess how this subdividision impacts the quality of the model. Testing 
always employed the MAP state of the full model on the full images. 

Table 1 reports the edge and node errors obtained under each experimental condition. 
To assess statistical significance, we performed binomial pairwise comparison tests at a 95% 
confidence level against the null hypothesis that each of the two algorithms being compared 
has an equal (50%) chance of outperforming the other on a given test image. 

We found no statistically significant difference here between 8x8 CRFs trained by MM 
vs. ML. However, ML training took 70 times as long to achieve this, so MM training is 
much preferable on computational grounds. 

Counter to our expectations, the node and edge errors suggest that MM training actually 
works better on small (8x8 and 16x16) image patches. We surmise that this is because 
small patches have a relatively larger perimeter, leading to better training of the bias edges. 
Pairwise comparison tests, however, only found the node error for the 32x32 patch-trained 
CRF to be significantly worse than for the smaller patches; all other differences were below 
the significance threshold. What we can confidently state is that subdividing the images 
into small patches did not hurt performance, and yielded much shorter training times. 

6.1.3 Maximum A Posteriori vs. Marginal Posterior Reconstruction 

Fox and Nicholls (2001) have argued that the MAP state does not summarize well the 
information in the posterior distribution of an Ising model of noisy binary images, and 
proposed reconstruction via the marginal posterior mode (MPM) instead. For binary labels, 
the MPM is simply obtained by thresholding the marginal posterior node probabilities: 
(Vz E V) yi \— \P(yi = l\x,0) > 0.5]. In our Ising models, however, we have marginal 
posterior probabilities for edges (Section 4.4.2), and infer node states from graph cuts 
(Algorithm 1). Here implementing the MPM runs into a difficulty: since the edge set 

{k e £ : P(fc e C(y\x, 0)) > 0.5} (38) 

may not be a cut of the model graph, hence may not unambiguously induce a node state. 
What we really need is the cut closest (in some given sense) to (38). For this purpose we 
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Table 2: Comparison of reconstruction methods on the denoising task, using the parameters 
of an MM-trained 8x8 CRF. The minimum in each result column is boldfaced; 
"test time" is the time required to reconstruct all 150 images. 



Test Method 


Patch Size 


Test Time 


Edge Error 


Node Error 


MAP 


64x64 


3.7s 


1.10% 


1.83% 


8x8 


3.2 s 


1.96% 


5.21 % 


M 3 P 


397.5 s 


1.95% 


3.32 % 



formulate the state with the maximal minimum marginal posterior (M 3 P): 

: = argmax mm . _;. n ) \ " _ . 7 (39) 
yf kes \ 1 -P(fc e C(y\x,9)) otherwise. v 7 

In other words, the M 3 P state (39) is induced by the cut whose edges (and those of its 
complement) have the largest minimum marginal probability. We can efficiently compute 
y+ as follows: 

1. Find the minimum-weight spanning tree £ + C £ of the model graph G(V,£) with 
edge weights -|P(fe e C(y\x, 0)) - 0.5|. (This can be done in 0{\£\ log \£\) time.) 

2. Run Algorithm 1 on G(V,£ + ) to find y~^~ as the state induced in the spanning tree 
by the MPM edge set (38). 

Since G(V, £ + ) is a tree, it contains no cycles, and Algorithm 1 will therefore unambiguously 
identify the M 3 P state. 

Table 2 lists the reconstruction errors achieved on the denoising task via MAP vs. M 3 P 
states, using the parameters of the MM-trained 8x8 CRF which gave the best performance 
in Section 6.1.2. (We obtained comparable results for ML training on 8x8 patches and MM 
training on full images.) Figure 10 shows representative sample reconstructions. 

When reconstructing 8x8 patches, both MAP and M 3 P states appear to achieve virtu- 
ally the same edge error. The binomial pairwise comparison test, however, reveals M 3 P to 
consistently outperform MAP here, even at a 99 % confidence level. This trend is confirmed 
by M 3 P's substantially lower node error. For comparison, an oracle that always picks the 
better of the two label-symmetric node states would yield a node error of 2.8% here; the 
excess node error relative to that baseline is almost 5 times lower for M 3 P than for MAP. 
This does confirm the limitations of the MAP state for reconstruction, as argued by Fox 
and Nicholls (2001). 

bmce the M 3 P state (39) requires calculation of the edge marginals, however, it takes 
over two order of magnitude longer to obtain than the MAP state on an 8x8 patch, and is 
computationally infeasible for us to compute for the entire 64x64 image. The MAP state 
on the entire image clearly outperforms any reconstruction from 8x8 patches in terms of 
both edge and node error. The impressive scalability of the MAP state computation via 
blossom-shrinking (Section 3.3) thus in practice overrules here the theoretical advantage of 
reconstruction based on marginal probabilities. 



30 



Efficient Exact Inference in Planar Ising Models 





cort con 

V!S VK 





Figure 10: Image reconstructions on the denoising task; columns from left to right: Noisy 
64x64 image, MAP reconstruction of full image, MAP reconstruction of 8x8 
patches, and M 3 P reconstruction of 8x8 patches. 



6.2 Boundary Detection 

To further scale up our approach, we designed a simple boundary detection task based 
on images from the GrabCut Ground Truth image segmentation database (Rother et al., 
2007a). We took 100x100 pixel subregions of images that depicted a segmentation bound- 
ary, and corrupted the segmentation mask with pink noise, produced by convolving a white 
noise image (all pixels i.i.d. uniformly random) with a Gaussian density with one pixel 
standard deviation. 

We then employed a planar Ising model to recover the original boundary — namely, 
a 100x100 square grid with one additional edge pegged to a high energy, encoding prior 
knowledge that the bottom left and top right corners of the grid depict different regions. 
As before, the energy of the other edges was set to := ([1, \x{ — 0), where X{ is the 
pixel intensity at node i. We did not employ a bias node for this task, and simply set the 
regularization constant to A = 1. 

Note that this is a huge model: 10 000 nodes and 19 801 edges. Computing the partition 
function or marginals would require inverting a Kasteleyn matrix with over 1.5 • 10 9 entries; 
minimizing (29) is therefore computationally infeasible for us. Computing a ground state 
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Figure 11: Boundary detection by maximum- margin training of planar Ising grids; columns 
from left to right: original image, original segmentation mask, noisy mask for 
S/N ratio of 1:8 (top two rows) resp. 1:7 (bottom two rows), and MAP segmen- 
tation obtained from the Ising grid CRF trained on the noisy mask. 



via the algorithm described in Section 3, by contrast, takes only 0.3 CPU seconds on an 
Apple MacBook with 2.2 GHz Intel Core2 Duo processor. 6 We can therefore efficiently 
minimize (31) to obtain the MM parameter vector 0*, then compute the CRF's MAP (i.e., 
ground) state for rapid prediction. 

As before, we titrated for the smallest S/N ratio of the form 1 :n for which we obtained 
a good segmentation; depending on the image this occurred for n — 7 or 8. Figure 11 (right 

6. In scalability tests our code was able to compute the MAP state of a 400x400 grid in 0.85 CPU seconds. 
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column) shows that at these noise levels our approach is capable of recovering the original 
segmentation boundary quite well, with less than 1% of nodes mislabeled. For S/N ratios 
of 1:9 and lower the system was unable to locate the boundary; for S/N ratios of 1:6 and 
higher we obtained perfect reconstruction. Again this corresponded closely to our human 
ability to visually locate the segmentation boundary accurately. 

7. Discussion and Outlook 

We have proposed an alternative algorithmic framework for efficient exact inference in 
binary graphical models, which replaces the submodularity constraint of graph cut methods 
with a planarity constraint. Besides proving efficient and effective in first experiments, our 
approach opens up a number of interesting research directions to be explored: 

Our algorithms can all be extended to nonplanar graphs, at a cost exponential in the 
genus of the embedding. We are currently developing these extensions, which may prove of 
great practical value for graphs that are "almost" planar; examples include road networks 
(where edge crossings arise from overpasses without on-ramps) and graphs describing the 
tertiary structure of proteins (Vishwanathan et al., 2007). 

Our algorithms also provide a foundation for efforts to develop efficient approximate 
inference methods for nonplanar Ising models. Our method for calculating the ground 
state (Section 3) actually works for nonplanar graphs whose ground state does not contain 
frustrated non-contractible cycles. The QPBO graph cut method (Kolmogorov and Rother, 
2007) finds ground states that do not contain any frustrated cycles, and otherwise yields a 
partial labeling. Can we likewise obtain a partial labeling of ground states with frustrated 
non-contractible cycles? 

The existence of two distinct tractable frameworks for inference in binary graphical 
models implies a more powerful hybrid: Consider a graph each of whose biconnected com- 
ponents is either planar or submodular. In its entirety, this graph may be neither planar nor 
submodular, yet efficient exact inference in it is clearly possible by applying the appropriate 
framework to each component, then combining the results (Section 2.4.2). Can this hybrid 
approach be extended to cover less obvious situations? 
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